Exploring New Languages with HAIRCUT at CLEF 2005

نویسنده

  • Paul McNamee
چکیده

JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several nontraditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross Language Evaluation Forum : CLEF 2005 " Gareth Jones and Carol Peters

This presentation will report the activities of the CLEF 2005 evaluation campaign. CLEF 2005 consisted of 8 tracks focusing on topics in multilingual information retrieval. An assessment of the results will be given with particular focus on two important tracks: multilingual ad-hoc retrieval and cross-language search in image collections. The multilingual task this year had two objectives: to e...

متن کامل

Ad-hoc Mono- and Bilingual Retrieval Experiments at the University of Hildesheim

This paper reports on our participation in CLEF 2005‘s ad-hoc multi-lingual retrieval track. The ad-hoc task introduced Bulgarian and Hungarian as new languages. Our experiments focus on the two new languages. Naturally, no relevance assessments are available for these collections yet. Optimization was mainly based on French data from last year. Based on experience from last year, one of our ma...

متن کامل

Combining Passages in the Monolingual Task with the IR-n System

This paper describes our participation in monolingual tasks at CLEF-2005. In this research we have worked in the following languages: English, French, Portuguese, Bulgarian and Hungarian. Our task has been focused on using combined different size passages to improve the Information Retrieval process. Once we have studied the experiments which have been carried out and the official results at CL...

متن کامل

Cross-Language Retrieval Using HAIRCUT for CLEF 2004

JHU/APL continued to explore the use of knowledge-light methods for scalable multilingual retrieval during the CLEF 2004 evaluation. We relied on the language-neutral techniques of character n-gram tokenization, pre-translation query expansion, statistical translation using aligned parallel corpora, fusion from disparate retrievals, and reliance on language similarity when resources are scarce....

متن کامل

Dublin City University at CLEF 2005: Cross-Language Spoken Document Retrieval (CL-SR) Experiments

The Dublin City University participation in the CLEF CL-SR 2005 task concentrated on exploring the application of our existing information retrieval methods based on the Okapi model to the conversational speech data set. This required an approach to determining approximate sentence boundaries within the free-flowing automatic transcription provided. We also performed exploratory experiments on ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005